Statistical computation and visualization (MATH-517)
Modern society has been become increasingly dependent on the use of airplanes during the past decades. Despite the impressive engineering feat that aviation represents, airplanes and other aircraft sometimes fail, occasionally fatally. Understanding the causes of failure and considering what can be done to address these is of crucial interest for regulatory bodies and aviation authorities, but also for passengers. In the following dataset, we consider an extensive collection of incidents involving aircraft of different types across Brazil in the period of time 2006-2015. Various data have been collected for each incident. In what follows, we set out to investigate common features of these aircraft incidents.
The main goal of this project is to answer the following question: what are the main risk factors that led to a plane crash in Brazil from 2006 to 2015? Similarities between the accidents will be sought. Here are some more specific questions arising from the main problem that we will try to answer:
How are the accidents distributed on a map? Are there areas where accidents are more concentrated?
How do the accidents evolve through time? Is the occurrence rate constant? Are there periods with more accidents, i.e. during school vacation, summer? Do they happen at a certain time of the day?
Does the age of the aircraft play a role? Are old planes more prone to accidents?
What about the characteristics of the aircraft? Do certain features of the aircraft have an influence on the occurrences?
What are the main occurrence types or causes of accidents?
What are the causes that make an accident more serious than another?
To answer our question, we first represented the accidents by state in Brazil to visualize their distribution. To go further, we used \(\textsf{R}\) packages for geocoding (data with only place names, but no geographic coordinates) to represent each accident on a map.
We then turned to the temporal analysis. For this, we first made a visual approach and then an analytical one through a linear, logistic and quantile regression in order to study the association between the life of the aircraft, the severity of the accident and the level of damage.
Moreover, we made pie charts to see the shares between each category of each characteristic. We also checked whether some columns were highly correlated with each other. To dig further, we made some additional pie charts to understand the multiple causes of occurrences.
Finally, we wanted to find a good prediction of crash severity. To do this, the categorical data were transformed into numerical data by obtaining dummy variables and we performed a decision tree and logistic regression.
The dataset used in this report is from Kaggle provided by the CENIPA (Centro de Investigação e Prevenção de Acidentes aeronáuticos, Brazilian Open Data). The given dataset contains two files (aircrafts.csv and occurrences.csv), but merging them based on the unique occurrence ID (\(occurrence\_id\)) for each aircraft was possible, so the latter one (aircrafts_occurrences_merged.csv) was used to analyse the different occurrences from the aircrafts.
The file contains different information about the aircraft and the accident. First of all, the dataset contains several columns about the features of the aircrafts which encountered accidents:
Then, the dataset also contains columns about the occurrences:
| occurrence_id | 0 |
| aircraft_id | 0 |
| registration | 0 |
| operator_id | 0 |
| equipment | 0 |
| manufacturer | 110 |
| model | 15 |
| engine_type | 0 |
| engines_amount | 9 |
| takeoff_max_weight..Lbs. | 0 |
| seatings_amount | 18 |
| year_manufacture | 4 |
| registration_country | 0 |
| registration_category | 9 |
| registration_aviation | 0 |
| origin_flight | 1,110 |
| destination_flight | 1,204 |
| operation_phase | 0 |
| type_operation | 0 |
| damage_level | 0 |
| fatalities_amount | 1,688 |
| classification | 0 |
| type.of.occurrence | 0 |
| localization | 0 |
| fu | 2 |
| country | 0 |
| aerodrome | 1,226 |
| time | 0 |
| under_investigation | 0 |
| investigating_command | 0 |
| investigation_status | 0 |
| report_number | 436 |
| published_report | 1,042 |
| recommendation_amount | 0 |
| aircrafts_involved | 0 |
| takeoff | 1,787 |
The dataset contains variables with numerous missing values. Normally, we will not need them. Table 1 displays the number of missing values in each column. \(Takeoff\) is the top column containing NAs, with more than 87.4% of the data missing. We see also a significant number of missing data points in the column \(fatalaties\_amount\) with 82.6% and the columns related to the report such as \(published\_report\) with 51%. We can see that the data is most present for columns containing information that are independent of the accident such as \(aircraft\_id\), \(manufacturer\) and \(type\_operation\). Finally it is surprising to see that more than half of the information concerning \(origin\_flight\) and \(destination\_flight\) is missing.
When looking at the type of equipment, 78% of our data concerns airplanes. It is followed by helicopters that constitute a little less that 13% of the data. Then comes ultralights, a bit more than 7% of the dataset. The rest are airships, amphibious or unknown.
There are 120 unique manufacturers represented in our dataset. The top three manufacturers are the following: Indústria Aeronáutica Neiva, Embraer and Aero Boero with respectively 388, 155 and 126 planes.
78% of the aircrafts present in the dataset are motorized with a piston engine, 7% with a turboshaft engine, 6.8% with turboprop and the rest with jet, without traction or unknown engines.
Figure 1: Distribution of the number of engines
Considering Figure 1, 73.25% of the aircrafts have a single engine. 24% are twin-engine. The rest are equipped with either 3 or 4 engines. We believe that 0 refers to unknown data although there was no mentions of it in the data description. It could also be the value taken for a glider.
Figure 2: Distribution of take off max weight (in Lbs)
Figure 2 shows the distribution of takeoff max weight (in Lbs). It can be seen that that the mean weight is of 11,919.2 Lbs (5406.4 Kg). There are though very strong outliers with aircrafts having set their maximum weight at 630,499 Lbs (285,989.5 Kg).
Figure 3: Distribution of manufacturing year
In Figure 3 is displayed the distribution of manufacturing year. One can observe that the aicrafts represented in the dataset were manufactured between 1936 and 2015. The year distribution is trimodal with high peaks in late 70s, early 90s and mid 00s.
Concerning the registration country, 2000 out of the 2043 constituting the data are registered in Brazil. 22 are registered in the United States of America and finally there is one accident registered in Germany, France, Poland, Russia, Saudi Arabia, South Africa, Spain and Uruguay.
37% of the accidents concern \(private\) registrations. 18% come from \(instruction\) registration. The rest are either \(aerotaxi\) (13%), \(experimental\) (9.8%), \(agricultural\) (9%), \(regular\) (4%), \(specialized\) (3%) or other registrations.
19% of the accidents registered happened during landing, 17% during takeoff, 11% during cruise and 9% during the run after landing. Accidents also happen during the following phases: during maneuver (4.9%), ascension (4.8%), final approximation (3.4%), descend (3.3%), low altitude navigation (2.8%), traffic circuit (2.5%), taxi (2%), rush on the ground (1.4%) and other less represented phases.
Concerning the damage, 72.6% of crashes were classified as \(accident\) and the rest was classified as \(serious\) \(incident\). 58% of the aircrafts knew substantial damage after the crash. 17% are completely destroyed, 12.7% know light damage where 8% have no damage. The rest is unknown.
Figure 4: Distribution of fatalities
In Figure 4 is displayed the number of fatalities. It can be seen that most of the crashes involve less than 20 deaths. The deadliest crash happened to regularly scheduled domestic passenger flight from Porto Alegre São Paulo in 2007 where 199 lost their life during landing. The Airbus A320-233 executing the flight overran runway 35L at São Paulo during moderate rain and crashed into a nearby TAM Express warehouse adjacent to a Shell filling station. The plane exploded on impact, killing all 187 passengers and crew and 12 people on the ground (Wikipedia, 2021d).
Finally, the investigation of 52% of the crashes is finished. 37% are in progress while less than 0.05% of the crashes have had their investigation reopened.
This section is devoted to the study of the places where aircraft accidents occur. Indeed, it is essential to understand this in order to identify more risk factors.
First, a comparison between the Brazilian states allows us to draw a first conclusion (Figure 5). Note that accidents are cumulative over the 10-year period.
Figure 5: Number of accidents from 2006 to 2015 by state in Brazil
All the states have less than 200 accidents, more precisely less than 170 (Rio Grande do Sul), except one: São Paulo. This state is not larger than the others (it is rather medium in size), which indicates that there is clearly a concentration of accidents in this area. Note that ten accidents are not shown on the map (Table 2). Eight of them occurred outside of Brazil and two more for other reasons: one ended up in international waters and the other does not have its location identified.
| Countries | ARGENTINA | BRAZIL | COLOMBIA | ENGLAND | PARAGUAY | PERU | URUGUAY |
| Accidents | 1 | 2 | 1 | 1 | 2 | 1 | 2 |
An explanation for this observation is linked in particular to demography. Indeed, the state of São Paulo is the most populous in Brazil with more than 41 million inhabitants in 2010 (Wikipedia, 2021a). This represents more than 20% of the total population of the country. Air traffic is therefore concentrated there, especially since the largest airports of the country are located in this state (for example São Paulo Guarulhos International Airport and São Paulo Congonhas Airport). The risk of accidents is therefore higher.
To be more precise in the geographical approach, it is possible to establish a map according to the location of the accident (nearest city). As the geographic coordinates are not included in the data, it is possible to use geocoding given that the city is available. For this, we used two packages: ggmap and Nominatim. An API key is required to geocode for each package (Google Maps for ggmap and MapQuest for Nominatim). As the locations are quite precise, there are often errors in the coordinates returned by geocoding, which is why we used two packages by selecting the most reasonable coordinates. It is still possible to have some deviations between the city and the place shown on the map, but the latter should not be greater than a degree of longitude and latitude.
In Figure 6, accidents are represented in clusters which are scattered by zooming. For each accident, information is available such as the model, the manufacturer, the date or the time. Note that 11 more accidents have been removed compared to the last map. The reason is that the location has not been identified.
Figure 6: Number of accidents from 2006 to 2015 clustered by location in Brazil
We still notice the same thing, that is to say a concentration of accidents in the Southeast Region of Brazil. This is explained by the fact that the traffic there is the most dense, as this article confirms in particular (Oliveira et al., 2020). It turns out that some cities are noteworthy in terms of the number of accidents. Table 3 represents the ten cities with the most accidents.
| Cities | Rio De Janeiro | São Paulo | Goiânia | Brasília | Manaus | Belo Horizonte | Campo Grande | Londrina | Bragança Paulista | Porto Alegre |
| Accidents | 65 | 50 | 42 | 31 | 29 | 26 | 24 | 23 | 22 | 22 |
A demographic explanation is still reasonable. The first six cities in Table 3 are among the ten most populous cities in Brazil (Wikipedia, 2021b). Moreover, they also have airports on the list of the 20 busiest airports in Brazil (Wikipedia, 2021c).
Looking more closely at the accidents in these cities, we see that all kinds of aircraft are represented, whether they are airliners, tourist planes or even helicopters. To learn more in this direction, it is possible to further explore the accidents on the map (Figure 6).
Figure 7: Distribution of accidents throughout the hours of the day
In Figure 7 are displayed the distribution of accidents throughout the hours of the day. The distribution is bimodal and it can be observed that most of the crashes happen either at 2pm or at 8pm.
Figure 8: Distribution of accidents throughout days
Figure 8 shows the distribution of accidents over time. We can see an increasing trend, which may be due to the growth of the air transport market and the development of aircrafts. We can also observe a seasonal trend in the graph, which may be due to the fact that there are usually more flights during the summer.
Figure 9: Distribution of time difference between the occurrence and the publication date
In Figure 9 is displayed the distribution of time difference in days between the occurrence day and publication day. It can be seen that most of crashes are reported the same year. The longest time difference is of 3589 days. It was for a political operation caused by a loss of control in the air in year 2006 in Brazil.
In this section, we consider whether there is any association between the age of an aircraft and the severity and damage level of the incident. To begin, we have plotted the cumulative distribution function of the aircrafts’ age in Figure 10 below.
Figure 10: Cumulative distribution function of the aircraft’s lifetime, denoted by \(X\)
We observe that more than 80% of the aircrafts are between 0 and 40 years old at the time of the incident, with a small number of aircrafts reaching nearly 80 years.
Next, we provide a scatter plot to visually inspect the association between age and damage level in (11)
Figure 11: Scatter plot of damage level (coded on a 0-3 scale) versus aircraft lifetime in recorded incidents
and likewise between aircraft lifetime and accident severity (outcome) in (12).
Figure 12: Scatter plot of severity versus aircraft lifetime in recorded incidents
To investigate the marginal association between age and damage level we perform regression analysis using a linear model, logistic model in addition to quantile regression.
The fit diagnostic for the linear regression of damage level on lifetime of the plane gives
Call:
lm(formula = df.damage.age$damage_level ~ df.damage.age$lifetime)
Residuals:
Min 1Q Median 3Q Max
-1.8694 0.1309 0.1334 0.1356 1.1392
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.8693569 0.0341994 54.661 <2e-16 ***
df.damage.age$lifetime -0.0001235 0.0012004 -0.103 0.918
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.7949 on 1884 degrees of freedom
Multiple R-squared: 5.621e-06, Adjusted R-squared: -0.0005252
F-statistic: 0.01059 on 1 and 1884 DF, p-value: 0.918
Here, we have coded the damage level on a scale of 0 to 3. Furthermore, the confidence interval is given by
beta 2.5 % 97.5 %
(Intercept) 1.8693568526 1.802284164 1.936429541
df.damage.age$lifetime -0.0001235299 -0.002477756 0.002230697
which is small and includes the null value of the slope.
Likewise, the logistic regression of aircraft lifetime on severity (with corresponding confidence intervals) is given below:
Call:
glm(formula = classification ~ lifetime, family = "binomial",
data = df.damage.age)
Deviance Residuals:
Min 1Q Median 3Q Max
-0.8589 -0.7911 -0.7661 1.5910 1.6827
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.13355 0.09886 -11.466 <2e-16 ***
lifetime 0.00413 0.00342 1.208 0.227
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 2171.2 on 1885 degrees of freedom
Residual deviance: 2169.8 on 1884 degrees of freedom
AIC: 2173.8
Number of Fisher Scoring iterations: 4
(Intercept) lifetime
0.321890 1.004138
OR 2.5 % 97.5 %
(Intercept) 0.321890 0.2646635 0.390004
lifetime 1.004138 0.9974175 1.010886
Once again, the confidence interval includes the null value. Finally, we consider the quantile regression coefficient from the quantile regression model \[ Q_Y(\tau\mid X) = a_0(\tau) + b_0(\tau)X \] where the outcome \(Y\) is airplane lifetime and we take accident severity as an exposure \(X\). The resulting regression coefficient is plotted in Figure 13.
Figure 13: Quantile regression of damage level on the quantile of aircraft lifetime
The quantile plot shows that the conditional cumulative distribution function, conditioning on severity level “serious incident,” is narrower compared to the conditional cumulative distribution function, conditioning on severity level “accident.” In other words, both high and low quantiles are shifted towards the median.
As we did not find any strong associations between damage level and age of the aircraft marginally in the population, it motivated us to examine further whether such associations could exist within subsets of the population, such as the strata of incidents involving helicopters. Once again, we perform a logistic regression, which yields the following fit diagnostic and confidence intervals:
Call:
glm(formula = classification ~ lifetime, family = "binomial",
data = df.helicopters)
Deviance Residuals:
Min 1Q Median 3Q Max
-0.8589 -0.7911 -0.7661 1.5910 1.6827
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.13355 0.09886 -11.466 <2e-16 ***
lifetime 0.00413 0.00342 1.208 0.227
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 2171.2 on 1885 degrees of freedom
Residual deviance: 2169.8 on 1884 degrees of freedom
AIC: 2173.8
Number of Fisher Scoring iterations: 4
(Intercept) lifetime
0.321890 1.004138
OR 2.5 % 97.5 %
(Intercept) 0.321890 0.2646635 0.390004
lifetime 1.004138 0.9974175 1.010886
We do not find a strong association between accident severity and aircraft age within the stratum of helicopters either (the confidence interval for the lifetime coefficient includes the null-value, and the p-value of the coefficient is 0.227).
Different features of aircrafts are presented in the dataset, and some types of them might have caused the most accidents among all. An interactive plot (link) involving pie charts of the number of counts of different features is made to compare and determine which types of features have the most occurrences. Here are the output for variables \(equipment\), \(manufacturer\), \(model\), \(engine\_type\), \(engines\_amount\), \(registration\_aviation\), \(operation\_phase\), \(type\_operation\).
Figure 14: The shares of accidents coming from the same type of aircraft
Figure 15: The shares of accidents coming from the same manufacturer
Figure 16: The shares of accidents coming from the same aircraft model
Figure 17: The shares of accidents coming from the same type of aircraft engine
Figure 18: The shares of accidents coming from the same amount of engine
Figure 19: The shares of accidents coming from the same registration of aviation
Figure 20: The shares of accidents coming from the same operation phase
Figure 21: The shares of accidents coming from the same type of operation
Figure 14 to Figure 21 are the pie charts of the shares of different features of the aircrafts. First, certain features have a dominating one, such as the one for the type of aircraft from Figure 14 (airplane), the one for the type of engine from Figure 17 (piston), and the one for the amount of engine from Figure 18 (1). Then there are also certain features that are more shared by different categories, but the main one is still apparent, such as the one for the aviation registration from Figure 19 (Private) and the one for the type of operation from Figure 21 (private). Last, some are shared by several categories, such as the one for the operation phase from Figure 20 having landing and takeoff as the type phases in which the accidents occurred the most. The one for the manufacturer and the model (Figure 15 and Figure 16) are shared by multiple types of them, but we can observe that the share of the manufacturers are spread more or less equally for only certain of them, whereas the one of the model is shared more or less equally for almost all of them. We have a more dense share for the model: if we except the ones that have less than 20 counts, the amount would sum up to almost 60%.
Although there are certain types of aircraft features that encountered the most accidents, we cannot say much if they are more likely to encounter accidents than others, since we do not know the percentage among all the flights. It is highly possible that the main ones are also the most used ones, and the percentage of occurrences might not be bigger than the other ones. Therefore it is hard to conclude that the choice of certain features can be more dangerous to the passengers or not. We can still say that the most dangerous operation phase is when the aircraft is landing or taking off, since all the aircrafts go through all the phases during the operation.
The correlation between the different features is interesting to observe, since we can omit certain columns to analyse the type of the features of aircrafts that encountered the most accidents. However there are some subtle ties in analysing the correlation because the variables are categorical. As most of the variables are not ordered, several columns suspected to be correlated are chosen.
The aim is to see if between the two chosen columns, one is a further classification of the other one. For instance, we want to check if aircrafts of the same model are from the same manufacturer. In other words, we want to check if the column \(model\) is a further classification of the column \(manufacturer\). Mainly, we want to check if the aircraft models have a certain type of fixed features, because if they are, those features can be omitted and the information about the model is enough for the analysis.
To check, let A and B the columns. Then the count of categories of A is counted for each categories of B, and the sum of the number of outliers (all counts except the largest count from A) is computed. We made several hypotheses and checked if they were correct:
| A | equipment | manufacturer | takeoff max weight (Lbs) | seatings amount | engines amount | engine type | registration aviation |
| B | model | model | model | model | model | model | model |
| Number_of_outliers | 7 | 92 | 173 | 211 | 5 | 8 | 392 |
Table 4 shows the number of outliers computed from each pairs of columns. We can observe that on one hand, there are several columns that are indeed highly correlated to the model, such as the type of aircraft, the amount of engines and the type of engines. On the other hand, there are columns that do not often match with the one of the model: the \(registration\_aviation\) has the most outliers among all. We can conclude that although some columns can be omitted, the others are still important to consider. For instance, for \(registration\_aviation\), almost 20% of the cases are outliers, for \(takeoff\_max\_weight\) (Lbs) and \(seatings\_amount\) about 10% are outliers, and for \(manufacturer\) about 5% are outliers. We can see that quite often the aircrafts users changed some features from the original aircraft model.
The maximum weight for the takeoff and the number of available seats of the aircrafts have also been used for the analysis. The scatter plots are made for the analysis, since we are working with categorical variables that are ordered.
Figure 22: The shares of accidents coming from the same type of operation
Figure 22 is a scatter plot of the count of the accidents of each maximum weights. We can observe that most of the accidents are from aircrafts having lower maximum weights.
Figure 23: The shares of accidents coming from the same type of operation
Figure 23 is a scatter plot of the count of the accidents of each number of available seatings. Again, we can observe that most of the accidents are from the aircrafts having the least number of available seatings.
As a result, we cannot conclude on the possibility of occurrences on the features of the aircraft, but the observation made shows that most of the occurrences come from private (Figure 19 and 21) and small (Figure 22 and Figure 23) aircrafts.
Figure 24: Distribution of the type of occurrences
As you can see, there are a many causes of an accident to occur and working with these causes will be difficult. One idea is to divide the causes into different categories. We choose to divide them into causes that occur during takeoff, flight, landing or on the ground. Causes that can occur at any time will be designated as unknown.
Figure 25: Categorisation of the multiple type of occurrences
As you can see, most of the occurrences occur during the flight. Surprisingly, there are a lot of occurrences that happen on the ground and a negligible amount that occurs during takeoff.
Now we will study the severity of the accident depending on the causes. To do so, we have two columns that we focus on: the classification of the accident (if it was a serious incident or an accident) and the damage level on the aircraft. First, let us see how the classification can vary depending on the type of occurrences. An accident is an occurrence where the flight has been stopped, whereas a serious incident is an incident that involved circumstances that indicate a high probability of an accident.
Figure 26: Proportion of the classifcation depending on the type of occurrences
Most of the occurrences that happened are accidents. For the occurrences that happened during landing, there are at ???? 53% serious incident. To summarize we are going to plot the proportions of occurrences for each classification.
comment…
Figure 27: Proportion of types of occurences for each classifcation
However, notice that the proportions of categories of serious incidents are approximately the same. Speaking of accidents, more than half occur during the flight. The proportion of takeoff is very small for both of the classifications but this comes from the fact that there is a small number of occurrences during takeoff.
Figure 28: Pie Charts of the damage level depending on the classification
As we can see, incidents have more light or none damage and no destructive damage, where light and no damage are a small ???? of accident’s damages. We can say that accident’s (DONT USE ’ FOR FORMAL ENGLISH) damages are more severe (approximately 25% of aircrafts are destroyed during an accident compared to 0% during serious incidents). Of course, light damages occurred during accidents represent around 10%: that may be explained by accidents occurring on the ground and causing less damages. We now plot for each category, the proportion of damages.
Figure 29: Pie Charts of the damage level depending on the type of occurences
At any time of the flight, caused damages are mainly substantial. The least serious damages occur during takeoff (30%), during landing (19%) and on the ground (15%). The aircraft has the highest probability to get destroyed (26%) if the accident happens during the flight. So we can conclude that accidents occurring during takeoff, landing and on the ground are the least serious one. This is when the probability to get away with a light damage or no damage is the highest and when the probability for the aircraft to be destroyed is the lowest (2% for accidents occurring during landing).
To predict the severe accidents vs minor incidents and analyse the possible causes, we get dummy variables for each categorical variable, e.g. when one of the operation types is INSTRUCTION, we create a new column called type_operation_INSTRUCTION and give it 1 if the type is INSTRUCTION, 0 otherwise.
After getting dummy variables, we feed the dataset to our decision tree predictive model Figure 30 and obtain the following result.
Figure 30: Decision tree for severity of accident
When the engines_amount is greater than 1.5 and the takeoff_max_weight..Lbs. is larger than 14,770.5, the accident will be more likely to be severe.
Figure 31: Cross validation for the optimal depth of decision tree
By performing cross validation, we find the depth-2 tree achieves the best mean cross-validation accuracy 74.949 +/- 1.623%.
Logistic regression shows that the increases in engines amount, occurrence year, registration category EXT, registration category PIN, registration category PRI, registration category TPR, operation phase RUN AFTER LANDING or fu PA will increase the log odds of getting severe incident, while increases in engine type PISTON, operation phase FINAL APPROXIMATION, operation phase MANEUVER or operation phase Others will decrease the log odds of getting severe incident (Kshitiz Sirohi, 2018).
Figure 32: Check linearity assumption for logistic regression
As shown in Figure 32, most of variables has linearity, except that registration_category_PRI and registration_phase_RUN_AFTER_LANDING seem to be poor at linearity and might require further data transformation.
| . | |
| engines_amount | 1.52 |
| seatings_amount | 4.10 |
| occurrence_year | 1.07 |
| engine_type_JET | 2.04 |
| engine_type_PISTON | 1.57 |
| registration_category_ADE | 1.06 |
| registration_category_EXT | 2.37 |
| registration_category_Others | 1.23 |
| registration_category_PET | 1.15 |
| registration_category_PIN | 1.05 |
| registration_category_PRI | 1.46 |
| registration_category_TPR | 3.43 |
| registration_aviation_UNKNOWN | 2.61 |
| operation_phase_ASCENSION | 1.09 |
| operation_phase_CRUISE | 1.16 |
| operation_phase_DESCEND | 1.08 |
| operation_phase_FINAL.APPROXIMATION | 1.07 |
| operation_phase_MANEUVER | 1.14 |
| operation_phase_Others | 1.14 |
| operation_phase_RUN.AFTER.LANDING | 1.14 |
| fu_AM | 1.11 |
| fu_GO | 1.07 |
| fu_MG | 1.07 |
| fu_PA | 1.08 |
| fu_PR | 1.08 |
| fu_RS | 1.11 |
To satisfy collinearity assumption, we make sure all VIF value are less than 10 in Table 5. We manually removed the variables with VIFs greater than 10 ( kassambara, 2018).
Here are the VIF values for each variable.
Figure 33: Check no influential observations assumption for logistic regression
In the Cook’s distance in Figure 33, outliers in the dataset are presented and the 3 largest distance values are labelled which require us to further explore them.
# A tibble: 3 x 34
classification engines_amount seatings_amount occurrence_year
<dbl> <dbl> <dbl> <dbl>
1 0 3 301 2006
2 1 0 0 2009
3 0 3 0 2012
# ... with 30 more variables: engine_type_JET <dbl>,
# engine_type_PISTON <dbl>, registration_category_ADE <dbl>,
# registration_category_EXT <dbl>,
# registration_category_Others <dbl>,
# registration_category_PET <dbl>, registration_category_PIN <dbl>,
# registration_category_PRI <dbl>, registration_category_TPR <dbl>,
# registration_aviation_UNKNOWN <dbl>, ...
The three most extreme observations are shown above.
Figure 34: Plot out the standard residuals
To filter outliners, we try searching for the points with absolute value of standard residual to be greater than 3. As displayed in Figure 34, no such points in our dataset hence there is no influential observations.
We can assume this assumption is satisfied because each observation is an individual airplane crash accident.
We would like to perform DBSCAN or Kmeans clustering on our dataset. However, we have 84 columns in our current dataset and so it is necessary to reduce the dimensions first.
Figure 35: Principal component analysis on the dataset
The result of PCA in Figure 35 shows that when we keep 7 dimensions, only around 40% of variance will be captured and the curse of dimensionality might still exist. Hence it is not sensible to continue the investigation by performing clustering. Also, the DBSCAN does not give a good result on it hence not included.
By visualizing the accidents on a map, we found that there is a concentration of accidents in the state of São Paulo, or more generally in the whole Southeast Region of Brazil. The possible explanations are linked to demography or the high density of air traffic. As air activity is higher in these areas, the risk of an accident to occur increases.
Then, We employed linear, logistic and quantile regression to investigate the association between aircraft lifetime, accident severity and damage level, but did not find any statistically significant associations. We also did an analysis on the different features of the aircraft. Although it is hard to tell that certain type of features are more likely to encounter accidents, we could conclude that most of the accidents come from small aircrafts for private purposes. Moreover, certain features were mostly one of the characteristics of the model of the aircraft, so we can expect for them to be omitted.
Overall, for predicting accident severity, due to a small \(R^2\) in logistic regression and a relatively satisfying 75% accuracy from decision tree, we prefer decision tree over logistic regression as our predictive model. By looking at the decision tree, we use engines_amount as the major factor and type_operation_INSTRUCTION and takeoff_max_weight..Lbs. as the secondary factors to predict the severity of accident.
Finally, our analyses and research allowed us to answer our initial questions. We have looked at several potential factors that can increase the risk of accidents, and now have a better understanding of them, especially in Brazil.
Currently, we have too many categorical variables in our dataset. Affected by the limit of our machine learning techniques, we mainly used numerical variables while building up predictive models, hence we have to transform those categorical variables into numerical ones by adding lots of dummy variables. However, after transformation, the dimension becomes extremely large. We would like to perform clustering but we have to reduce the dimension first. However, the result of PCA shows that we cannot use this technique to reduce the dimension sufficiently and still make sure it contains most of the information in the dataset. Hence we require a larger dataset with more observations. Currently we are using Brazilian Aeronautics Accidents in 10 years. So it is possible to enlarge our dataset by expanding our investigation to a longer period as well as study more countries.